- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
Q6_K AVX improvements #10118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q6_K AVX improvements #10118
Conversation
small improvement with shuffle lut, likely because all loads are already done at that stage
| I get these results with  master: PR:  | 
| 
 Considering that most >3bpw quants use an output weight quantized in Q6_K (the Q6_K FTYPE using a Q6_K token embed weight as well), and that not offloading on GPU the output weight is done sometimes to save a bit of VRAM, would such AVX2 implementation reduce the loss of performances incurred when not offloading said output weight on GPU? I'm using a Ryzen 5700G, which is AVX2 I believe. | 
| 
 I mean it should help in theory, but considering how the output weight is only a small part of the full model you probably won't notice a thing. I guess the easiest way to test this is to benchmark a Q4 or Q5 model with this PR. | 
| I had to update the CI target macOS-latest-cmake-x64 to run on macos-13 as macos-12 seems to have been depreciated by Github. The macos-13 machines are still x86 so this won't generate arm binaries (see #7940). | 
| 
 I do. AMD FX is AVX only. | 
* metal : fix minor string leaks (ggml/1004) * cmake : make it possible linking ggml as external lib (ggml/1003) * sync : ggml * CANN: adjust backend registry refactor. (ggml-org#10158) remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR. * metal : move dequantize templates to beginning of MSL source (#0) * metal : simplify f16 and f32 dequant kernels (#0) * cuda : clear error after changing peer access (ggml-org#10153) * fix build break on arm64 linux (ggml-org#10166) This fixes the build break from the recent changes to move the CPU backend to separate files ggml-org#10144 * server : clarify /slots endpoint, add is_processing (ggml-org#10162) * server : clarify /slots endpoint, add is_processing * fix tests * ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (ggml-org#10167) * ggml : fix gelu tables initialization (ggml-org#10172) * Q6_K AVX improvements (ggml-org#10118) * q6_k instruction reordering attempt * better subtract method * should be theoretically faster small improvement with shuffle lut, likely because all loads are already done at that stage * optimize bit fiddling * handle -32 offset separately. bsums exists for a reason! * use shift * Update ggml-quants.c * have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86 * ggml : fix arch check in bf16_to_fp32 (ggml-org#10164) * llama : add <|tool_call|> formatting to Granite template (ggml-org#10177) Branch: GraniteToolCallTemplate Signed-off-by: Gabe Goodhart <[email protected]> * metal : add quantized FA support (ggml-org#10149) * metal : add quantized FA (vec) support ggml-ci * metal : add quantized FA (non-vec) support * metal : fix support check ggml-ci * metal : clean-up * metal : clean-up (cont) * metal : fix shared memory calc + reduce smem + comments * metal : float-correctness * metal : minor [no ci] * ggml : adjust is_first_call init value (ggml-org#10193) ggml-ci * metal : fix from ptr buffer name (ggml-org#10189) * server : remove hack for extra parallel slot (ggml-org#10187) ggml-ci * metal : add BF16 support (ggml-org#8439) * ggml : add initial BF16 support ggml-ci * metal : add mul_mat_id BF16 support ggml-ci * metal : check for bfloat support on the Metal device ggml-ci * metal : better var names [no ci] * metal : do not build bfloat kernels when not supported ggml-ci * metal : try to fix BF16 support check ggml-ci * metal : this should correctly check bfloat support --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]>
* Merge PR (#10) (#11) (#13) Merge --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump requests from 2.31.0 to 2.32.2 in the pip group across 1 directory Bumps the pip group with 1 update in the / directory: [requests](https://github.com/psf/requests). Updates `requests` from 2.31.0 to 2.32.2 - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](psf/requests@v2.31.0...v2.32.2) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production dependency-group: pip ... Signed-off-by: dependabot[bot] <[email protected]> * Temp (#15) * metal : fix minor string leaks (ggml/1004) * cmake : make it possible linking ggml as external lib (ggml/1003) * sync : ggml * CANN: adjust backend registry refactor. (ggml-org#10158) remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR. * metal : move dequantize templates to beginning of MSL source (#0) * metal : simplify f16 and f32 dequant kernels (#0) * cuda : clear error after changing peer access (ggml-org#10153) * fix build break on arm64 linux (ggml-org#10166) This fixes the build break from the recent changes to move the CPU backend to separate files ggml-org#10144 * server : clarify /slots endpoint, add is_processing (ggml-org#10162) * server : clarify /slots endpoint, add is_processing * fix tests * ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (ggml-org#10167) * ggml : fix gelu tables initialization (ggml-org#10172) * Q6_K AVX improvements (ggml-org#10118) * q6_k instruction reordering attempt * better subtract method * should be theoretically faster small improvement with shuffle lut, likely because all loads are already done at that stage * optimize bit fiddling * handle -32 offset separately. bsums exists for a reason! * use shift * Update ggml-quants.c * have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86 * ggml : fix arch check in bf16_to_fp32 (ggml-org#10164) * llama : add <|tool_call|> formatting to Granite template (ggml-org#10177) Branch: GraniteToolCallTemplate Signed-off-by: Gabe Goodhart <[email protected]> * metal : add quantized FA support (ggml-org#10149) * metal : add quantized FA (vec) support ggml-ci * metal : add quantized FA (non-vec) support * metal : fix support check ggml-ci * metal : clean-up * metal : clean-up (cont) * metal : fix shared memory calc + reduce smem + comments * metal : float-correctness * metal : minor [no ci] * ggml : adjust is_first_call init value (ggml-org#10193) ggml-ci * metal : fix from ptr buffer name (ggml-org#10189) * server : remove hack for extra parallel slot (ggml-org#10187) ggml-ci * metal : add BF16 support (ggml-org#8439) * ggml : add initial BF16 support ggml-ci * metal : add mul_mat_id BF16 support ggml-ci * metal : check for bfloat support on the Metal device ggml-ci * metal : better var names [no ci] * metal : do not build bfloat kernels when not supported ggml-ci * metal : try to fix BF16 support check ggml-ci * metal : this should correctly check bfloat support --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]>
* Master1 (#17) * Merge PR (#10) (#11) (#13) Merge --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump requests from 2.31.0 to 2.32.2 in the pip group across 1 directory Bumps the pip group with 1 update in the / directory: [requests](https://github.com/psf/requests). Updates `requests` from 2.31.0 to 2.32.2 - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](psf/requests@v2.31.0...v2.32.2) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production dependency-group: pip ... Signed-off-by: dependabot[bot] <[email protected]> * Temp (#15) * metal : fix minor string leaks (ggml/1004) * cmake : make it possible linking ggml as external lib (ggml/1003) * sync : ggml * CANN: adjust backend registry refactor. (ggml-org#10158) remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR. * metal : move dequantize templates to beginning of MSL source (#0) * metal : simplify f16 and f32 dequant kernels (#0) * cuda : clear error after changing peer access (ggml-org#10153) * fix build break on arm64 linux (ggml-org#10166) This fixes the build break from the recent changes to move the CPU backend to separate files ggml-org#10144 * server : clarify /slots endpoint, add is_processing (ggml-org#10162) * server : clarify /slots endpoint, add is_processing * fix tests * ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (ggml-org#10167) * ggml : fix gelu tables initialization (ggml-org#10172) * Q6_K AVX improvements (ggml-org#10118) * q6_k instruction reordering attempt * better subtract method * should be theoretically faster small improvement with shuffle lut, likely because all loads are already done at that stage * optimize bit fiddling * handle -32 offset separately. bsums exists for a reason! * use shift * Update ggml-quants.c * have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86 * ggml : fix arch check in bf16_to_fp32 (ggml-org#10164) * llama : add <|tool_call|> formatting to Granite template (ggml-org#10177) Branch: GraniteToolCallTemplate Signed-off-by: Gabe Goodhart <[email protected]> * metal : add quantized FA support (ggml-org#10149) * metal : add quantized FA (vec) support ggml-ci * metal : add quantized FA (non-vec) support * metal : fix support check ggml-ci * metal : clean-up * metal : clean-up (cont) * metal : fix shared memory calc + reduce smem + comments * metal : float-correctness * metal : minor [no ci] * ggml : adjust is_first_call init value (ggml-org#10193) ggml-ci * metal : fix from ptr buffer name (ggml-org#10189) * server : remove hack for extra parallel slot (ggml-org#10187) ggml-ci * metal : add BF16 support (ggml-org#8439) * ggml : add initial BF16 support ggml-ci * metal : add mul_mat_id BF16 support ggml-ci * metal : check for bfloat support on the Metal device ggml-ci * metal : better var names [no ci] * metal : do not build bfloat kernels when not supported ggml-ci * metal : try to fix BF16 support check ggml-ci * metal : this should correctly check bfloat support --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> * Rename build.yml to build-ci.yml * build.yml * Update build-ci.yml * Update CMakeLists.txt * Update CMakeLists.txt * Update CMakeLists.txt * Delete ggml/src/vulkan-shaders/CMakeLists.txt * Update build.yml * Update build-ci.yml * Update build-ci.yml --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]>
* q6_k instruction reordering attempt * better subtract method * should be theoretically faster small improvement with shuffle lut, likely because all loads are already done at that stage * optimize bit fiddling * handle -32 offset separately. bsums exists for a reason! * use shift * Update ggml-quants.c * have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86
* merge (#20) * Master1 (#17) * Merge PR (#10) (#11) (#13) Merge --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump requests from 2.31.0 to 2.32.2 in the pip group across 1 directory Bumps the pip group with 1 update in the / directory: [requests](https://github.com/psf/requests). Updates `requests` from 2.31.0 to 2.32.2 - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](psf/requests@v2.31.0...v2.32.2) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production dependency-group: pip ... Signed-off-by: dependabot[bot] <[email protected]> * Temp (#15) * metal : fix minor string leaks (ggml/1004) * cmake : make it possible linking ggml as external lib (ggml/1003) * sync : ggml * CANN: adjust backend registry refactor. (ggml-org#10158) remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR. * metal : move dequantize templates to beginning of MSL source (#0) * metal : simplify f16 and f32 dequant kernels (#0) * cuda : clear error after changing peer access (ggml-org#10153) * fix build break on arm64 linux (ggml-org#10166) This fixes the build break from the recent changes to move the CPU backend to separate files ggml-org#10144 * server : clarify /slots endpoint, add is_processing (ggml-org#10162) * server : clarify /slots endpoint, add is_processing * fix tests * ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (ggml-org#10167) * ggml : fix gelu tables initialization (ggml-org#10172) * Q6_K AVX improvements (ggml-org#10118) * q6_k instruction reordering attempt * better subtract method * should be theoretically faster small improvement with shuffle lut, likely because all loads are already done at that stage * optimize bit fiddling * handle -32 offset separately. bsums exists for a reason! * use shift * Update ggml-quants.c * have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86 * ggml : fix arch check in bf16_to_fp32 (ggml-org#10164) * llama : add <|tool_call|> formatting to Granite template (ggml-org#10177) Branch: GraniteToolCallTemplate Signed-off-by: Gabe Goodhart <[email protected]> * metal : add quantized FA support (ggml-org#10149) * metal : add quantized FA (vec) support ggml-ci * metal : add quantized FA (non-vec) support * metal : fix support check ggml-ci * metal : clean-up * metal : clean-up (cont) * metal : fix shared memory calc + reduce smem + comments * metal : float-correctness * metal : minor [no ci] * ggml : adjust is_first_call init value (ggml-org#10193) ggml-ci * metal : fix from ptr buffer name (ggml-org#10189) * server : remove hack for extra parallel slot (ggml-org#10187) ggml-ci * metal : add BF16 support (ggml-org#8439) * ggml : add initial BF16 support ggml-ci * metal : add mul_mat_id BF16 support ggml-ci * metal : check for bfloat support on the Metal device ggml-ci * metal : better var names [no ci] * metal : do not build bfloat kernels when not supported ggml-ci * metal : try to fix BF16 support check ggml-ci * metal : this should correctly check bfloat support --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> * Rename build.yml to build-ci.yml * build.yml * Update build-ci.yml * Update CMakeLists.txt * Update CMakeLists.txt * Update CMakeLists.txt * Delete ggml/src/vulkan-shaders/CMakeLists.txt * Update build.yml * Update build-ci.yml * Update build-ci.yml --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> * Update build-ci.yml --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]>
* q6_k instruction reordering attempt * better subtract method * should be theoretically faster small improvement with shuffle lut, likely because all loads are already done at that stage * optimize bit fiddling * handle -32 offset separately. bsums exists for a reason! * use shift * Update ggml-quants.c * have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86
* Merge (#21) * merge (#20) * Master1 (#17) * Merge PR (#10) (#11) (#13) Merge --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump requests from 2.31.0 to 2.32.2 in the pip group across 1 directory Bumps the pip group with 1 update in the / directory: [requests](https://github.com/psf/requests). Updates `requests` from 2.31.0 to 2.32.2 - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](psf/requests@v2.31.0...v2.32.2) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production dependency-group: pip ... Signed-off-by: dependabot[bot] <[email protected]> * Temp (#15) * metal : fix minor string leaks (ggml/1004) * cmake : make it possible linking ggml as external lib (ggml/1003) * sync : ggml * CANN: adjust backend registry refactor. (ggml-org#10158) remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR. * metal : move dequantize templates to beginning of MSL source (#0) * metal : simplify f16 and f32 dequant kernels (#0) * cuda : clear error after changing peer access (ggml-org#10153) * fix build break on arm64 linux (ggml-org#10166) This fixes the build break from the recent changes to move the CPU backend to separate files ggml-org#10144 * server : clarify /slots endpoint, add is_processing (ggml-org#10162) * server : clarify /slots endpoint, add is_processing * fix tests * ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (ggml-org#10167) * ggml : fix gelu tables initialization (ggml-org#10172) * Q6_K AVX improvements (ggml-org#10118) * q6_k instruction reordering attempt * better subtract method * should be theoretically faster small improvement with shuffle lut, likely because all loads are already done at that stage * optimize bit fiddling * handle -32 offset separately. bsums exists for a reason! * use shift * Update ggml-quants.c * have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86 * ggml : fix arch check in bf16_to_fp32 (ggml-org#10164) * llama : add <|tool_call|> formatting to Granite template (ggml-org#10177) Branch: GraniteToolCallTemplate Signed-off-by: Gabe Goodhart <[email protected]> * metal : add quantized FA support (ggml-org#10149) * metal : add quantized FA (vec) support ggml-ci * metal : add quantized FA (non-vec) support * metal : fix support check ggml-ci * metal : clean-up * metal : clean-up (cont) * metal : fix shared memory calc + reduce smem + comments * metal : float-correctness * metal : minor [no ci] * ggml : adjust is_first_call init value (ggml-org#10193) ggml-ci * metal : fix from ptr buffer name (ggml-org#10189) * server : remove hack for extra parallel slot (ggml-org#10187) ggml-ci * metal : add BF16 support (ggml-org#8439) * ggml : add initial BF16 support ggml-ci * metal : add mul_mat_id BF16 support ggml-ci * metal : check for bfloat support on the Metal device ggml-ci * metal : better var names [no ci] * metal : do not build bfloat kernels when not supported ggml-ci * metal : try to fix BF16 support check ggml-ci * metal : this should correctly check bfloat support --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> * Rename build.yml to build-ci.yml * build.yml * Update build-ci.yml * Update CMakeLists.txt * Update CMakeLists.txt * Update CMakeLists.txt * Delete ggml/src/vulkan-shaders/CMakeLists.txt * Update build.yml * Update build-ci.yml * Update build-ci.yml --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> * Update build-ci.yml --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> * Update build-ci.yml * Update build-ci.yml --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]>
* Temp (#23) * Merge (#21) * merge (#20) * Master1 (#17) * Merge PR (#10) (#11) (#13) Merge --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump requests from 2.31.0 to 2.32.2 in the pip group across 1 directory Bumps the pip group with 1 update in the / directory: [requests](https://github.com/psf/requests). Updates `requests` from 2.31.0 to 2.32.2 - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](psf/requests@v2.31.0...v2.32.2) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production dependency-group: pip ... Signed-off-by: dependabot[bot] <[email protected]> * Temp (#15) * metal : fix minor string leaks (ggml/1004) * cmake : make it possible linking ggml as external lib (ggml/1003) * sync : ggml * CANN: adjust backend registry refactor. (ggml-org#10158) remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR. * metal : move dequantize templates to beginning of MSL source (#0) * metal : simplify f16 and f32 dequant kernels (#0) * cuda : clear error after changing peer access (ggml-org#10153) * fix build break on arm64 linux (ggml-org#10166) This fixes the build break from the recent changes to move the CPU backend to separate files ggml-org#10144 * server : clarify /slots endpoint, add is_processing (ggml-org#10162) * server : clarify /slots endpoint, add is_processing * fix tests * ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (ggml-org#10167) * ggml : fix gelu tables initialization (ggml-org#10172) * Q6_K AVX improvements (ggml-org#10118) * q6_k instruction reordering attempt * better subtract method * should be theoretically faster small improvement with shuffle lut, likely because all loads are already done at that stage * optimize bit fiddling * handle -32 offset separately. bsums exists for a reason! * use shift * Update ggml-quants.c * have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86 * ggml : fix arch check in bf16_to_fp32 (ggml-org#10164) * llama : add <|tool_call|> formatting to Granite template (ggml-org#10177) Branch: GraniteToolCallTemplate Signed-off-by: Gabe Goodhart <[email protected]> * metal : add quantized FA support (ggml-org#10149) * metal : add quantized FA (vec) support ggml-ci * metal : add quantized FA (non-vec) support * metal : fix support check ggml-ci * metal : clean-up * metal : clean-up (cont) * metal : fix shared memory calc + reduce smem + comments * metal : float-correctness * metal : minor [no ci] * ggml : adjust is_first_call init value (ggml-org#10193) ggml-ci * metal : fix from ptr buffer name (ggml-org#10189) * server : remove hack for extra parallel slot (ggml-org#10187) ggml-ci * metal : add BF16 support (ggml-org#8439) * ggml : add initial BF16 support ggml-ci * metal : add mul_mat_id BF16 support ggml-ci * metal : check for bfloat support on the Metal device ggml-ci * metal : better var names [no ci] * metal : do not build bfloat kernels when not supported ggml-ci * metal : try to fix BF16 support check ggml-ci * metal : this should correctly check bfloat support --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> * Rename build.yml to build-ci.yml * build.yml * Update build-ci.yml * Update CMakeLists.txt * Update CMakeLists.txt * Update CMakeLists.txt * Delete ggml/src/vulkan-shaders/CMakeLists.txt * Update build.yml * Update build-ci.yml * Update build-ci.yml --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> * Update build-ci.yml --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> * Update build-ci.yml * Update build-ci.yml --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> * Bump the pip group across 2 directories with 2 updates (#24) Updates the requirements on [pillow](https://github.com/python-pillow/Pillow) and [aiohttp](https://github.com/aio-libs/aiohttp) to permit the latest version. Updates `pillow` to 11.0.0 - [Release notes](https://github.com/python-pillow/Pillow/releases) - [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst) - [Commits](python-pillow/Pillow@10.2.0...11.0.0) Updates `aiohttp` to 3.11.7 - [Release notes](https://github.com/aio-libs/aiohttp/releases) - [Changelog](https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst) - [Commits](aio-libs/aiohttp@v3.9.3...v3.11.7) --- updated-dependencies: - dependency-name: pillow dependency-type: direct:production dependency-group: pip - dependency-name: aiohttp dependency-type: direct:production dependency-group: pip ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: apicalshark <[email protected]> * Update build-ci.yml * Update build-ci.yml * Update build-ci.yml * Update build-ci.yml * Update build-ci.yml * Update build-ci.yml * Update build-ci.yml * Update build-ci.yml * Create docker.yml * Create python-lint.yml * Create server.yml * Update requirements.txt --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: dennyxbox890 <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: snadampal <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Eve <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]>
I've updated the Q6_K AVX implementation for better prompt processing and text generation speed. Most of the improvements came from using the
bsumsdata to separately calculate the Q6 offsets (it's precomputed for a reason!), reducing the number of instructions.The same techniques should also be applicable to the AVX2 implementation if someone wants to work on that. It looks like the ARM implementation already uses
bsumscorrectly.By the way I'm kinda curious if anyone here actually uses llama.cpp on an AVX only machine, or if I'm the only one 😞.
Results (Xeon E5 v2):
Master
PR
test-backend-opsandtest-quantize-fnsis passing.